Factored Language Models Tutorial
نویسندگان
چکیده
The Factored Language Model (FLM) is a flexible framework for incorporating various information sources, such as morphology and part-of-speech, into language modeling. FLMs have so far been successfully applied to tasks such as speech recognition and machine translation; it has the potential to be used in a wide variety of problems in estimating probability tables from sparse data. This tutorial serves as a comprehensive description of FLMs and related algorithms. We document the FLM functionalities as implemented in the SRI Language Modeling toolkit and provide an introductory walk-through using FLMs on an actual dataset. Our goal is to provide an easy-to-understand tutorial and reference for researchers interested in applying FLMs to their problems. Overview of the Tutorial We first describe the factored language model (Section 1) and generalized backoff (Section 2), two complementary techniques that attempt to improve statistical estimation (i.e., reduce parameter variance) in language models, and that also attempt to better describe the way in which language (and sequences of words) might be produced. Researchers familar with the algorithms behind FLMs may skip to Section 3, which describes the FLM programs and file formats in the publicly-available SRI Language Modeling (SRILM) toolkit.1 Section 4 is a step-by-step walkthrough with several FLM examples on a real language modeling dataset. This may be useful for beginning users of the FLMs. Finally, Section 5 discusses the problem of automatically tuning FLM parameters on real datasets and refers to existing software. This may be of interest to advanced users of FLMs.
منابع مشابه
Morphology-based language modeling for arabic speech recognition
Language modeling is a difficult problem for languages with rich morphology. In this paper we investigate the use of morphology-based language models at different stages in a speech recognition system for conversational Arabic. Classbased and single-stream factored language models using morphological word representations are applied within an N-best list rescoring framework. In addition, we exp...
متن کاملFast Exact Inference with a Factored Model for Natural Language Parsing
We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unl...
متن کاملRescoring n-best lists for Russian speech recognition using factored language models
In this paper, we present a research of factored language model (FLM) for rescoring N-best lists for Russian speech recognition task. As a baseline language model we used a 3gram language model. Both baseline and factored language models were trained on a text corpus collected from recent news texts on Internet sites of online newspapers; total size of the corpus is about 350 million words (2.4...
متن کاملMorpheme-Based Language Modeling for Amharic Speech Recognition
This paper presents the application of morpheme-based and factored language models in an Amharic speech recognition task. Since using morphemes in both acoustic and language models results, mostly, in performance degradation due to acoustic confusability and since it is problematic to use factored language models in standard word decoders, we applied the models in a lattice rescoring framework....
متن کاملFactored Neural Language Models
Language models based on a continuous word representation and neural network probability estimation have recently emerged as an alternative to the established backoff language models. At the same time, factored language models have been developed that use additional word information (such as parts-of-speech, morphological classes, and syntactic features) in conjunction with refined back-off str...
متن کامل